1 Background

This file describes the preliminary analyses of three test-concepts in the QLVLnewscorpora: penis, inleiding & hart. The concepts were selected from the full list of concepts (N = 433) that I collected from WordNet, Van Dale and DLP2. Information about the full set of concepts is available here:

2 Model parameters

At this moment, parameters selection was based on observations in Mariana's analyses of nouns & verbs, as well as comments in the parameters google doc. At this moment, the following parameter settings were used to construct token models:

parameter name FOC SOC
Definition target type lemma/pos lemma/pos
Window size fixed: 10 fixed: 4
Boundaries sentence/none none
cw selection: strategy local/global global
cw selection: settings local:
* nav with freq > 200
* collfreq = 3
* ppmi > 1
* llr None or > 1
* global:
nav top-5000
nav top-5000
Weighting ppmi none

Of these, I plan to vary the boundaries (default: sentence) and the context word selection settings for FOC's. Specifically, I will compare implementing an LLR-filter or not within the "local"1 strategy, as well as a local versus a global2 strategy. In the latter case, all top-5000 nav context words will be considered.

3 Concept 1: penis

This concept was selected because it's a difficult one, with many variables (N = 17, excluding constructions) and varying frequencies per variable.

variant frequency
ding/noun 80601
fluit/noun 1447
jongeheer/noun 105
lid/noun 107912
lul/noun 1155
mannelijkheid/noun 459
penis/noun 1252
piemel/noun 372
pik/noun 451
pisser/noun 4
plasser/noun 18
potlood/noun 1504
sjarel/noun 6
snikkel/noun 18
speer/noun 1217
tampeloeres/noun 1
zwengel/noun 42

This causes two problems for the models & analysis:

  • Some variants are not frequent enough for the token models3. As a result, these variants are not returned by the models. The too infrequent variants are: pisser/noun (N = 4), plasser/noun (N = 18), sjarel/noun (N = 6), snikkel/noun (N = 18) and tampeloeres/noun (N = 1).
  • Some variants are too frequent (specifically the polysemous variants ding and lid), which causes computational issues. In addition, even if all the tokens for these variants would be modelled, further steps in the analyses (e.g. clustering) would be problematic too, as the results would be biased towards the highly frequent variants.

A possible solution for the latter problem is to only sample the relevant tokens for the highly frequent types. This can be done in two ways:

  • determining the relevant context words for all variants for the penis-concept
  • determining the relevant context words for the most prototypical variants for the penis-concept (cf. cue-validity). This would be the variant penis. While this strategy might be more easy to implement, as penis is not a highly polysemous word, it may4 also be dangerous because you run the risk of excluding context words that only occur in particular contextual settings (e.g. jocular language).
    Note that it is at this point an open question to which extent this strategy will be necessary (and feasable) for all the concepts in the dataset. In addtition, using this strategy implies that disambiguation needs to be done before constructing the final tokenmodel for all the variants. More specifically, we first need to figure out which model and which clusterin algorithm gives the best semantic analysis of the concept if only the non-problematic variants are included. As a second step, we can then select the semantic space of the token cloud resulting from the best model to determine which FOC's can be considered as candidate FOC's for the problematic variants ding and lid.5

3.1 Selecting context words

3.1.1 Strategies for finding the best model

To find a way of extracting context words for the problematic variants, we need a tokenmodel for the non-problematic ones that performs well. The best model would be a model that (1) has a good fit to the data (to avoid artifical effects, e.g. regional differences) and (2) has a (relatively) clear semantic region (or branch) where most observations for the target concept are located (precision), while out-of-concept tokens are located somewhere else (recall). As in other studies in the NephoSem-project, determining what the best model is, is not straightforward. There are a number of procedures that can be considered:

  • manual disambiguation of all (or some of) the tokens (cf. Mariana's raters)
  • manual inspection of the tokenclouds (but precision vs. recall)
  • automatic disambiguation by also overlaying token vectors for an associated word6 of out-of-concept senses of polysemous items (e.g. papier or pen for potlood). This may complicate the analysis as previous studies have shown that models perform better on certain tasks (e.g. synonyms or associated items) depending on the window size that is used. We can hopefully alleviate this problem by also including other monosemous variants to select context words (e.g. piemel, lul).
  • automatic disambiguation excluding tokens that have particular context words (cf. Dirk's example of scherp for potlood)
  • seperation indices
  • manual inspection of frequenct context words in clusters, excluding clusters that clearly don't have the target meaning
  • ...

3.1.2 Models

So far, eight solutions with tsne-clustering and three (four) models with nmds have been constructed. All the tokenmodels have the following parameters:

parameter name FOC SOC
Definition target type lemma/pos lemma/pos
Window size fixed: 10 fixed: 4
Boundaries sentence/none none
cw selection: strategy local/global global
cw selection: settings local:
* nav with freq > 200
* collfreq = 3
* ppmi > 1
* llr None or > 1
* global
nav top-5000
Weighting ppmi none

You can find a shiny-app to explore the models that have been analyzed so far here.

3.1.2.1 t-SNE-models

The t-SNE-solutions additionally vary according to two parameters:

  • Number of runs used to calculate the solution: 1000 or 5000. This is particularly useful for this dataset because the variants have very unequal frequencies (and some a 9 times more frequent than others), so a stable solution may not be reached as fast.
  • perplexity: 10, 20, 30, 50.

Overall, it looks like the more stable models are the ones with more runs and perplexity 30. Models with very low perplexity (N = 10) look like they have too many small clusters. Choosing other settings than 'lemma' for the colors in the model plot shows that none of the lectal variables in the data seem to play a role.

3.1.2.2 NMDS-models

I have tried four NMDS-solutions so far:

  1. k = 2, max number of random restarts = 207
  2. increasing the number of restarts to 500
  3. increasing both k and the restarts: k = 5, max number of random restarts = 100
  4. increasing the restarts some more: k = 5, max. number of random restarts = 250

The first nmds-solutions is really bad, with a high stress value (> 0.28) and it did not converge. The second solution ran for over twelve hours and was only at trial 178 with stress-values comparable the first solution (at this point, I killed the process). The third and fourth solutions are the best ones so far, with in both cases a stress value of 0.1334 (for the same trial), but still no convergence. The second dimension may be the one were after but it's not the case that all variants with the target meaning are at the bottom of the plot, not that all out-of-concept variants are at the top.

Since we're running into problems with the NMDS-models, I analyzed where the problematic tokens are located. I used goodness() from library(vegan) to obtain a goodness-of-fit-value per token:

goodness() finds a goodness of fit statistic for observations (points). This is defined so that sum of squared values is equal to squared stress. Large values indicate poor fit.

This plot shows the results for the fourth NMDS-solution. It shows that the less problematic tokens (lighter colours) are located at the top left op the plot, where the observations for fluit and potlood are located - typically with their prototypical meaning, as well as with the tokens for penis in the middle. Perhaps the model has more trouble with variants that are more polysemous

3.1.2.3 Hierarchical clustering

Finally, I also used agglomerative hierarchical clustering (Ward's method) to analyze these data. Rather than choosing a number of clusters beforehand, I considered between 2 and 50 clusters, basing the optimal number of clusters on the silhouette width of the clusters. The optimal number of clusters is 45 in these data (sw = 0.358), with solutions that have 15 clusters or more reaching acceptable results (sw > 0.2). Obviously, solutions with 15 or more clusters are difficult to intepret, but for the purpose of illustration, this plot shows the solution with 15 clusters isolate one cluster by double-clicking on its symbol in the legend). The x- and y-axis show the results from the t-SNE-solution with perplexity = 30 and 5000 runs. Some of the clusters make a lot of sense, especially when they're also the ones that are seperated by t-SNE as well (e.g. the ouwe lul-cluster at the right of the plot in magenta). Other are more diverse (e.g. clusters 2 and 5).

With fewer clusters only some of the clearer division are (obviously) retained. Cluster 3 in the solution above, for instance, has body parts as context words. In the solution below, it is added to the more diverse cluster 2.


  1. Defined in the google doc as: "potentially all words within the specified window span around the target token". Note that my definition of "local" is not extreme, as I am only including nav's with a frequency of > 200. However, it is local in the sense that potentially all these words can be considered (N = 37807)

  2. Defined in the google doc as "fixeds set of context words, same for all target types". Here the 5000 most frequent nav's.

  3. This may also be related to the parameter settings that are used, e.g. if no nav's of frequency > 200 occur with the target type in a particular token/observation, this type is not included in the model.

  4. Note that while it may be dangerous to use this strategy, it doesn't have to be. We just don't know yet.

  5. An alternative strategy may be to semasiologically analyze these variants. Specifically for ding this could be a fruitful approach, because this variant is highly polysemous and is also included as a hgh-level word in the WordNet-taxonomies. It is not known whether the penis-meaning of ding would show up in such an analysis.

  6. We could select high-frequency candidates from the association data of Gert Storms for this purpose.

  7. This determines how many times the algrotihm can try to find a stable solution. If it doesn't succeed in the specified number of random starts, there is no succesful convergence.